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Abstract 

A novel framework of compressed sensing, namely statistical compressed sensing (SCS), that aims 
at efficiently sampling a collection of signals that follow a statistical distribution, and achieving accurate 
reconstruction on average, is introduced. SCS based on Gaussian models is investigated in depth. For 
signals that follow a single Gaussian model, with Gaussian or Bernoulli sensing matrices of ^{k) 
measurements, considerably smaller than the ff{klog(N/k)) required by conventional CS based on sparse 
models, where is the signal dimension, and with an optimal decoder implemented via linear filtering, 
significantly faster than the pursuit decoders applied in conventional CS, the error of SCS is shown tightly 
upper bounded by a constant times the best A:-term approximation error, with overwhelnoing probability. 
The failure probability is also significantly smaller than that of conventional sparsity-oriented CS. Stronger 
yet simpler results further show that for any sensing matrix, the error of Gaussian SCS is upper bounded 
by a constant times the best A:-term approximation with probability one, and the bound constant can be 
efficiently calculated. For Gaussian mixture models (GMMs), that assume multiple Gaussian distributions 
and that each signal follows one of them with an unknown index, a piecewise linear estimator is introduced 
to decode SCS. The accuracy of model selection, at the heart of the piecewise linear decoder, is analyzed 
in terms of the properties of the Gaussian distributions and the number of sensing measurements. A 
maximum a posteriori expectation-maximization algorithm that iteratively estimates the Gaussian models 
parameters, the signals model selection, and decodes the signals, is presented for GMM-based SCS. In 
real image sensing applications, GMM-based SCS is shown to lead to improved results compared to 
conventional CS, at a considerably lower computational cost. 

I. Introduction 

Compressed sensing (CS) aims at achieving accurate signal reconstruction while sampling signals at 
a low sampling rate, typically far smaller than that of Nyquist/Shannon. Let x G be a signal of interest, 
^ G j^A^xw ^ non-adaptive sensing matrix (encoder), consisting of M <^N measurements, y = ^>x € M/^ 
a measured signal, and A a decoder used to reconstruct x from ^x. CS develops encoder-decoder pairs 
(^,A) such that a small reconstruction error x — A(^x) can be achieved. 

Reconstructing x from <I>x is an ill-posed problem whose solution requires some prior information on 
the signal. Instead of the frequency band-limit signal model assumed in classic Shannon sampling theory. 
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conventional CS adopts a sparse signal model, i.e., there exists a dictionary, typically an orthogonal 
basis *F G M'^^^, a linear combination of whose columns generates an accurate approximation of the 
signal, X *Fa, the coefficients a[m], I <m < N, having their amplitude decay fast after being sorted. 
For signals following the sparse model, it has been shown that using some random sensing matrices such 
as Gaussian and Bernoulli matrices <I> with M = 0' {klog{N / k)) measurements, and an Zi minimization 
or a greedy matching pursuit decoder A promoting sparsity, with high probability CS leads to accurate 
signal reconstruction: The obtained approximation error is tightly upper bounded by a constant times 
the best fc-term approximation error, the minimum error that one may achieve by keeping the k largest 
coefficients in a |[T2| . fl3l . lITTl . ifTSl . Redundant and signal adaptive dictionaries that further improve 
the CS performance with respect to orthogonal bases have been investigated ifTTI . |[T9l . IIBTI . In addition 
to sparse models, manifold models have been considered for CS as well O, fl6l . 

The present paper introduces a novel framework of CS, namely statistical compressed sensing (SCS). 
As opposed to conventional CS that deals with one signal at a time, SCS aims at efficiently sampling 
a collection of signals and having accurate reconstruction on average. Instead of restricting to sparse 
models, SCS works with general Bayesian models. Assuming that the signals x follow a distribution with 
probability density function (pdf) /(x), SCS designs encoder-decoder pairs (<I>,A) so that the average 
error 



where || • ||x is a norm, is small. As an important example, SCS with Gaussian models is here shown 
to have improved performance (bounds) relative to conventional CS, the signal reconstruction calculated 
with an optimal decoder A implemented via a fast linear filtering. Moreover, for Gaussian mixture models 
(GMMs) that better describe most real signals, SCS with a piecewise linear decoder is investigated. 

The motivation of SCS with Gaussian models is twofold. First, controlling the average error over a 
collection of signals is useful in signal acquisition, not only because one is often interested in acquiring 
a collection of signals in real applications, but also because more effective processing of an individual 
signal, an image or a sound for example, is usually achieved by dividing the signal in (often overlapping) 
local subparts, patches (see Figure [T0|) or short-time windows for instance, so a signal can be regarded as 
a collection of subpart signals IH, ||8l, ll35l . |[36l . In addition, Gaussian mixture models (GMMs), which 
model signals or subpart signals with a collection of Gaussians, assuming each signal drawn from one 
of them, have been shown effective in describing real signals, leading to state-of-the-art results in image 
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inverse problems [36] and missing data estimation Il24l . 

SCS based on a single Gaussian model is first developed in Section Following a similar mathe- 
matical approach as the one adopted in conventional CS performance analysis [17], it is shown that with 
the same random matrices as in conventional CS, but with a considerably reduced number M = ff{k) of 
measurements, and with the optimal decoder implemented via linear filtering, significantly faster than the 
decoders applied in conventional CS, the average error of Gaussian SCS is tightly upper bounded by a 
constant times the best ^-term approximation error with overwhelming probability, the failure probability 
being orders of magnitude smaller than that of conventional CS. Moreover, stronger yet simpler results 
further show that for any sensing matrix, the average error of Gaussian SCS is upper bounded by a constant 
times the best ^-term approximation with probability one, and the bound constant can be efficiently 
calculated. 



Section III extends SCS to GMMs. A piecewise linear GMM-based SCS decoder, which essentially 
consists of estimating the signal using each Gaussian model included in the GMM and then selecting the 
best model, is introduced. The accuracy of the model selection, at the heart of the scheme, is analyzed in 
detail in terms of the properties of the Gaussian distributions and the number of sensing measurements. 
These results are then important in the general area of model selection from compressed measurements. 



Following 1 36], Section |IV] presents an maximum a posteriori expectation-maximization (MAP-EM) 
algorithm that iteratively estimates the Gaussian models and decodes the signals. GMM-based SCS 
calculated with the MAP-EM algorithm is applied in real image sensing, leading to improved results 
with respect to conventional CS, at a considerably lower computational cost. 

II. Performance Bounds for a Single Gaussian Model 

This section analyzes the performance bounds of SCS based on a single Gaussian model. Perfect 
reconstruction of degenerated Gaussian signals is briefly discussed. After reviewing basic properties of 
linear approximation for Gaussian signals, the rest of the section shows that for Gaussian signals with 
fast eigenvalue decay, the average error of SCS using k measurements and decoded by a linear estimator 
is tightly upper bounded by that of best ^-term approximation. 

Signals x € are assumed to follow a Gaussian distribution in this section. Principal 

Component Analysis (PCA) calculates a basis change a = B^(x — /i) of the data x, with B the orthonormal 
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PCA basis that diagonalizes the data covariance matrix 

I = BSB^, (1) 

where S = diag(Ai, . . . , A/^) is a diagonal matrix whose diagonal elements Ai > A2 > . . . > Aa? are the 
sorted eigenvalues, and a ~ ^(0,S) the PCA coefficient vector L27il . In this section, for most of the 
time we will assume without loss of generality that x ~ ^(0,S) by looking in the PCA domain. For 
Gaussian and Bernoulli matrices that are known to be universal, analyzing CS in canonical basis or PCA 
basis is equivalent ||4l. 

A. Degenerated Gaussians 

Conventional CS is able to perfectly reconstruct ^-sparse signals, i.e., x G with at most k non- 
zero entries (typically k<^N), with 2k measurements lITTll . Degenerated Gaussian distributions ■yV{Q,Sk), 
where S^^ = diag(Ai, . . . , A^,0, . . . ,0) with at most k non-zero eigenvalues, give the counterpart of k- 
sparsity for the Gaussian signal models considered in this paper. Such signals belong to a linear subspace 

= {x|x[m] = 0, V/c <m < N}. The next lemma gives a condition for perfect reconstruction of signals 
in^,. 

Lemma I. If ^ is any M x N matrix and k is a positive integer, then there is a decoder A such that 
A(<I>x) = X, for all x G =5^^, if and only if nNull(<I>) = 0. 

Proof: Suppose there is a decoder A such that A(<I>x) = x, for all x^.^^. Let x = nNull(<I>). We 
can write x = xi — X2 where both xi,X2 G S^k- Since Ox = 0, <I>xi = <I>X2. Plugging <I>xi and <I>X2 into 
the decoder A, we obtain xi = X2 and then x = xi — X2 = 0. 

Suppose ^^nNull(<I>) = 0. If xi,X2 G S^k with Oxi =<I>X2, then xi -X2 G ^^:nNull(<I>), so xi =X2. 
is thus a one-to-one map. Therefore there must exist a decoder A such that A(<I>x) = x. ■ 

It is possible to construct matrices of size M xN with M = k which satisfies the requirement of the 
Lemma. A trivial example is [Imxm\^mx{n-m)]^ where ImxM is the identity matrix of size MxM and 
Omx{n-m) is a zero matrix of size M x {N — M). Comparing with conventional compressed sensing that 
requires 2k measurements for exact reconstruction of ^-sparse signals, with only k measurements signals 
in can be exactly reconstructed. Indeed, in contrast to the ^-sparse signals where the positions of 
the non-zero entries are unknown (^-sparse signals reside in multiple ^-dimensional subspaces), with the 
degenerated Gaussian model J^{Q,Sk), the position of the non-zero coefficients are known a priori to be 



February 18, 2011 



DRAFT 



5 



the first k ones, k measurements thus suffice for perfect reconstruction. 

In the following, we will concentrate on the more general case of non-degenerated Gaussian signals 
(i.e., Gaussians with full-rank covariance matrices Z) with fast eigenvalue decay, in analogy to compress- 
ible signals for conventional CS. As mentioned before and will be further experimented in this paper, 
such simple models not only lead to improved theoretical bounds, but also provide state-of-the-art image 
reconstruction results. 

B. Optimal Decoder 

To simplify the notation, we assume without loss of generaUty that the Gaussian has zero mean 
/I = 0, as one can always center the signal with respect to the mean. 

It is well-known that the optimal decoders for Gaussian signals are calculated with linear filtering: 

Theorem 1. ^ Let xgR^ be a random vector with prior pdf ^(0,1), and <I> € R^^^, M <N, be 
a sensing matrix. From the measured signal y = <t>x G M'^, the optimal decoder A that minimizes the 
mean square error (MSE) £'x[||x — A(<I>x)||2] = ming£'x[||x — g(<I>x)||2], as well as the mean absolute error 
(MAE) £x[||x-A(<I>x)||i] =ming£:x[||x-g(<I>x)||i], where g : ^ , is obtained with a linear MAP 
estimator, 

A(<I>x) = argmaxp(x|y) = r<I>^(<I>r<I>^)"' (Ox), (2) 

X V ' 

A 

and the resulting error T] = x — A(<I>x) is Gaussian with mean zero and with covariance matrix Y.-q = 
E^[r]r]'^]=Z-Z^^{^Z^^)^^l., whose trace yields the MSE of SCS. 

£x[||x-A(<i>x)||^] = rr(r-r<i>^(<i>r<i>^)"i<i>r). (3) 

In contrast to conventional CS, for which the l\ minimization or greedy matching pursuit decoders, 
calculated with iterative procedures, have been shown optimal under certain conditions on <I> and the signal 
sparsity lITll . |[T8l . Gaussian SCS enjoys the advantage of having an optimal decoder ([2]) calculated fast 
via a closed-form linear filtering for any <I>. 

Corollary 1. If a random matrix <I> G is drawn independently to sense each x, with all the other 

conditions as in Theorem |7] the MSE of SCS is 

£x,<i>[||x-A(<I>x)||i] =£ci.[rr(I-EO^(<I>r<I>^)-i<I>r)]. (4) 

Applying an independent random matrix realization to sense each signal has been considered in fTTj . 
In real applications, these random sensing matrices need not to be stored, since the decoder can regenerate 
them itself given the random seed. 
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Following a PCA basis change ([T]), it is equivalent to consider signals x ^(/x,E) and x ^(0,S), 
where S = diag(Ai, . . . is a diagonal matrix whose diagonal elements Ai > A2 > . . . > Aa? are the 
sorted eigenvalues. Theorem [T] and Corollary [T] clearly hold for x ~ ^(0,S), with (|2]), Q, and Q 
rewritten as 



A(<I>x) = SO' (<I>S<I>')"H*J>x), (5) 

^ V ' 

A 

£x[||x-A(<I>x)||2] = rr(S-S<I>^(<I>S<I>^)-i<I>S), (6) 
£x,<i>[||x-A(<I>x)||^] = £ci>[rr(S-S<I>^(<I>S<I>^)"^<I>S)]. (7) 

Note that PCA bases, as sparsifying dictionaries, have been applied to do conventional CS based on 
sparse models [29], which is fundamentally different than the Gaussian models and SCS here studied. 

C. Linear vs Nonlinear Approximation 

Before proceeding with the analysis of the SCS performance, let us make some comments on the 

relationship between linear and non-linear approximations for Gaussian signals. In particular, the following 

is observed via Monte Carlo simulations: 

For Gaussian signals x ^ ^(0, S), where S = diag(Ai , . . . , A/^) whose eigenvalues X\ > A2 > . . . > Aa? 
decay fast, the best k-term linear approximation 

_ / \<m<k, 
''^^'"^-\ k^\<m<N, 

and the nonlinear approximation 

x^ = r,(x), (9) 

where is a thresholding operator that keeps the k coefficients of largest amplitude and setting others 
to zero, lead to comparable approximation errors 

ct/({x})x=£x[||x-x[|W and (J,«({x})x = £x[||x-x^|W. (10) 



Monte Carlo simulations are performed to test this. Assuming a power decay of the eigenvalues lITTll . 

A„, = m"", l<m<A^, (11) 
where a > is the decay parameter, with = 64, Figure [T] plots the MSEs 

CTi({x})2=£x[||x-x[||2] and a^X{x})i = £x[||x - x^'||^], (12) 
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normalized by the ideal signal energy ||x||2, of best k-teim linear and nonlinear approximations as a 
function of a, with typical (for image patches of size 8 x 8 for example) k values 8 and 16 (k/N =1/8 
and 1/4). Both MSEs decrease as a increases, i.e., as the eigenvalues decay faster. With typical values 
a 3 (similar to the eigenvalue decay calculated with typical image patches) and k = 8 or 16, both 
approximations are accurate and generate small and comparable MSEs, their difference being about 
0.1% of the signal energy and ratio about 2. 




(a) (b) (c) 

Fig. 1. (a). MSEs (normalized by the ideal signal energy) of best fc-term linear and non-linear approximation, with fc = 8 
and 16 (signal dimension A' = 64). (b) and (c) Difference and ratio of normalized MSEs of best A;-term linear and non-linear 
approximation shown in (a). 



Following this, the error of Gaussian SCS will be compared with that of best ^-term linear approx- 
imation, which is comparable to that of best ^-term nonlinear approximation. For simplicity, the best 
k-term linear approximation errors will be denoted as 

a,{{x})x = cjl{{x})x and CT,({x})2 = (T/({x})i (13) 

Note that at{{x})l = I^JJ^^+j A^. 

D. Performance of Gaussian SCS — A Numerical Analysis At First 

This section numerically evaluates the MSE of Gaussian SCS, and compares it with the minimal 
MSE generated by best k-term linear approximation, proceeding the theoretical bounds later developed. 



As before, a power decay of the eigenvalues (11), with = 64, is assumed in the Monte Carlo 



simulations. An independent random Gaussian matrix realization <I> is applied to sense each signal x |17|. 

Figures |2] (a) and (c)-top plot the MSE (normalized by the ideal signal energy) of SCS and that of 
the best ^-term linear approximation, as well as their ratio as a function of a, with k fixed at typical 
values 8 and 16 {k/N = 1/8 and 1/4). As a increases, i.e., as the eigenvalues decay faster, the MSEs for 
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both methods decrease. Their ratio increases almost Unearly with a. The same is plotted in figures [2] (b) 
and (c)-bottom, with eigenvalue decay parameter fixed at a typical value a = 3, and with k varying from 
5 to 32 {k/N from 5/64 to 1/2). As k increases, both MSEs decrease, their ratio being almost constant 
at about 3.7. 



Best-k, k=8 

— SCS, k=8 

— Best-k, k=16 

— SCS, k=16 





— Best-k 

— SCS 


A 


— k=8 

— k=16 




1 i 






(a) 



(b) 



(c) 



Fig. 2. Comparison of the MSB of SCS and tliat of tfie best fc-term linear approximation for Gaussian signals of dimension 
A' = 64. (a) and (c)-top. The MSB (normalized by the ideal signal energy) of SCS and that of best fe-term linear approximation, 
as well as their ratio as a function of a, with k fixed at typical values 8 and 16. (b) and (c)-bottom. The same values, with 
eigenvalue decay parameter fixed at a typical value a = 3, and with k varying from 5 to 32. 



These results indicate a good performance of Gaussian SCS, its MSE is only a small number of 
times larger than that of the best fc-term linear approximation. The next sections provide mathematical 
analysis of this performance. 

Let us notice that while the best k-temv linear approximation decoding is feasible for signals following 
a single Gaussian distribution, it is impractical with GMMs (assuming multiple Gaussians and that each 
signal is generated from one of them with an unknown index), since the Gaussian index of the signal 
is unknown. SCS with GMMs, which describe real data considerably better than a single Gaussian 



model 11361 . will be described in sections III and IV 



E. Performance Bounds 

Following the analysis techniques in ifTTll . this section shows that with Gaussian and Bernoulli random 
matrices of ^{k) measurements, considerably smaller than the 0" {k\og{N / k)) required by conventional 
CS, the average error of Gaussian SCS is tightly upper bounded by a constant times the best ^-term linear 

'simulations using the same coefficient energy power decay model \\ \\ show that the ratio between conventional CS based 
on sparse models, with k measurements, and that of the best fc-term nonlinear approximation, varies as a function of the decay 
parameter a and k. Bor typical values a = 3, the ratio is typically an order of magnitude larger than that between the MSB of 
SCS and that of the best A;-term linear approximation. 
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approximation error with overwhelming probabiUty, the failure probability being orders of magnitude 
smaller than that of conventional CS. 

We consider only the encoder-decoder pairs (<I>,A) that preserve Ox, i.e., <I>(A(<I>x)) = Ox, satisfied 
by the optimal A in Q for Gaussian signals x, VO. 

1 ) From Null Space Property to Instance Optimality: The instance optimality in expectation bounds 



the average error of SCS with a constant times that of the best ^-term linear approximation ( [T3| ), defining 
the desired SCS performance: 

Definition 1. Let x £ be a random vector that follows a certain distribution. Let K C {1, • • • ,A^} be 
any subset of indices. We say that (<I>,A) is instance optimal in expectation in K in \\ ■ \\x, with a constant 
Co, if 

£^,(<i,)[||x-A(<I>x)||x] <Co£x[||x-x^||x], (14) 

where Xk is the signal x restricted to K fx/f [/i] = x[?i], y n ^ K, and otherwise), the expectation on 
the left side considered with respect to x, and to O ;/ one random O is drawn independently for each x. 
Similarly, the MSB instance optimality in K is defined as 

£x,(o)[||x-A(<I>x)||i] <Co£x[||x-x^||i]. (15) 

In particular, if K = {\, . . . ,k}, then we say that (O, A) is instance optimal in expectation of order k 
in \\-\\x, with a constant Co, if 

£^,(<i>)[||x-A(<I>x)||x] <Co£x[||x-x^||x] =CoCT,({x})x, (16) 

and is instance optimal of order k in MSB, with a constant Co, if 

£x,(<i.)[||x-A(<I>x)||^] <Co£x[||x-x^||i] =Coa,({x})i (17) 

The null space property in expectation defined next will be shown equivalent to the instance optimality 
in expectation. 

Definition 2. Let x € be a random vector that follows a certain distribution. Let K C {1, • • • ,A^} be 
any subset of indices. We say that <I> in (<I>,A) has the null space property in expectation in K in \\ ■ \\x, 
with constant C, if 

E^,{f)l-\\^\\x]<CE^[\\V-T]K\\x], where T]=x-Ai<^x), (18) 

where tIk is the signal T] restricted to K (riK[n] = ri[n], V n G K, and otherwise), the expectation 
considered on the left side with respect to x, and to <I> ;/ one random <I> is drawn independently for each 
X. Note that r\ € Null(<I>). Similarly, the MSB null space property in K is defined as 

E^,m\\^\\l<CBy,[\\^-^K\\ll where ri =x-A{^x). (19) 

In particular, if K = {I, . . . ,k}, with 1 < k <N, then we say that <I> in (<I>,A) has the null space 
property in expectation of order k in || • \\x, with constant C, if 

B^^(^^)\\ri\\x <CB^[\\ri -riK\\x]=COki{ri})x, where ri =x-A(<I>x), (20) 
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and has the MSE null space property of order k, if 

Ex,{ct>)\\ri\\l<CESri-riKf2]=Ccjt{{ri})i where ri =x- A{^x). (21) 

Theorem 2. Let x € be a random vector that follows a certain distribution. Given an M xN matrix 
<I>, a norm \\-\\x, <^nd a subset of indices KG {1, • • • ,A^}, a sufficient condition that there exists a decoder 
A such that the instance optimality in expectation in K in \\-\\x ( |14[ ) holds with constant Co, is that the 
null space property in expectation ( |18[ ) holds with C = Cq/2 for this (<I>,A); 

E^X'b)\M\x] < -^ESTl-riK\\x], where 77 = x - A(<I>x). (22) 



A necessary condition is the null space property in expectation ( 18 1 with C = Cq: 

£x,(<i>)[||T7lW <Co£x[||T7-T]jf||x], where T7 =x-A{^x), (23) 
Similar results hold between the MSE instance optimality in K ( 15 1 and the null space property \19\ , 



with the constant C = Co/4 in the sufficient condition. 

In particular, if K = {I, . . . ,k}, with I <k<N, the same equivalence between the instance optimality 



in expectation of order k in \\-\\x (|16|) and the null space property in expectation (20), and that between 



the MSE instance optimality of order k ill) and the null space property (21 1, hold as well. 



Proof: To prove the sufficiency of ( [22] ), we consider the decoder A such that for all y = Ox G M/^, 

A(y) := argmin||z-z^||x s.t. Oz = y. (24) 

z 

By ([22]), we have 



£||x-A(<I>x)||z < ^£x[||(x-A(<I>x))-(xj,-(A<I>x)j,)||x] (25) 

< ^{E^[\\x-xk\\x]+E^[\\A{^x)-{A^x)k\\x]) (26) 

< CoE^[\\x-xk\\x], 

where the second inequality uses the triangle inequality, and the last inequahty follows from the choice 
of the decoder (|24l). 



To prove the necessity of (23 1, let A be any decoder for which ( 14i holds. Let 77 = x — A(<I>x) and let 
rjK be the linear approximation of T7 in ^ (T7/s:["] = J7['^]5 ^ n £ K, and otherwise). Let rjK = ^71 + ^72 
be any splitting of rjK into two vectors in the linear space J^k = {x|x[m] = 0,\/k ^ K}. We can write 

ri = rii + ri2 + ri3, 



with T73 = T7 — rjK. As the right side of ([4\ is equal to for V x € ^k, we deduce — T71 = A(<I>(— 171)). 
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Since rj gNu11(<I>), we have <!>(— T]i) = <I>(t]2 + T]3), so that — T]i = A(<I>(t]2 + T]3)). We derive 

E[\\ri\\x] = £[||r72 + T?3-A<I>(T72 + 773)||x]<Co£x[||(r72 + r73)-((T72k + (T73k)||z] 
= CoE^[\\ri-riK\\x], 



where the inequality follows from (14i, and the second and third equalities use the fact that Tj = rii + 



T72 + T]3 and T]i G Thus we have obtained ( |23l ). 



A similar proof proceeds for MSE instance optimality and null space property. The second part of 
the theorem is a direct consequence of the first part. ■ 

Comparing to conventional CS that requires the null space property to hold with the best 2^-term 
nonlinear approximation error |17], the requirement for Gaussian SCS is relaxed to k, thanks to the 
linearity of the best k-teim linear approximation for Gaussian signals. 

Theorem |2] proves the existence of the decoder A for which the instance optimality in expectation 
holds for (<I>,A), given the null space property in expectation. However, it does not explain how such 
decoder is implemented. The following Corollary, a direct consequence of theorems [T] and [2j shows that 
for Gaussian signals the optimal decoder ^ leads to the instance optimality in expectation. 

Corollary 2. For Gaussian signals x ^(0,S), if an M xN sensing matrix <I> satisfies the null space 
property in expectation ( |20| ) of order k in || • ||i, with constant Co/2, or the MSE null space property ( |21| ) 
of order k with constant Co/4, then the optimal and linear decoder A = S<I>^(<I>S<I>^)^^ satisfies the 
instance optimality in expectation ( 16l in || • ||i, or the MSE instance optimality ( |17| l. 



Proof: It follows from Theorem [T] that the MAP decoder minimizes MAE and MSE among all the 
estimators for x~ c/K(0,S). Therefore its MAE and MSE are smaller than the ones generated by the 
decoder considered in Theorem |2 (24i. The latter satisfies the instance optimality, so is the former. ■ 



2) From RIP to Null Space Property: The Restricted Isometry Property (RIP) of a matrix measures 
its ability to preserve distances, and is related to the null space property in conventional CS HlM . |[T8]| . 
The new linear RIP of order k restricts the requirement of conventional RIP of order ^ to a union of 
/c-dimensional linear subspaces with consecutive supports: 

Definition 3. Let k<N be a positive integer Let ,Jti define a linear subspace of functions with support 
in the first k indices in [1,A^], a linear subspace of functions with support in the next k indices, and 
so on. The functions in the last linear subspace J(fj defined this way may have support with less than k 
indices. An M xN matrix <I> is said to have linear RIP of order k with constant 5 if 

(l-5)||x||2 < ||<I>x||2 < (l + 5)||x||2, VxGUj^i^-. (27) 
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The linear RIP is a special case of the block RIP 11211 . with block sparsity one and blocks having 
consecutive support of the same size. 



The following theorem relates the linear RIP (27i of a matrix <I> to its null space property in 



expectation (20 1. 



Theorem 3. Let x G be a random vector that follows a certain distribution. Let ^ be an M x N 
matrix that satisfies the linear RIP of order 2k with 5 < I, and let A be a decoder Let T] = x — A(<I>x). 
Assume further that £x,(<l)) I T ['^] I decays in n: E^^{<^) \Tl[n+l] \ < £^x,(<l>) I ^ ['^] I- ^n < N — I. Then <I> satisfies 
the null property in expectation of order k in || • ||i (20 1, with constant Co = 1 |^ 



Proof: Let K denote the set of first k indices of the entries in T\, K\ the next k indices, K2 the next 
k indices, etc. We have 

J 

\\^Kh < ||T]/fU/fil|2< (1-5)"^ ||<I>T7/fu/fil|2 = (1- 5)^^11 ^<I>T7if.||2 

;=2 

< (i-5)-i£||<I>T7;f.||2<(l + 5)(l-5)-i£||T7^,||2, 

where the second and last inequalities follow the linear RIP property of <I>, the third inequality follows 

from the triangle equality, and the equality holds since T] G Null(<I>). Hence we have 

J 

E\\^Kh<{l+8){\-5)-'Y.E\\^Ki\\2. (28) 

i=2 

Since £'|T][?i + fc]| <£'|t7[«]|, we derive £'||T7if.^j ||i < £||T7/fy||i, so that 

E\\^Kj+^\2<E\\^KjJU<E\\r]KJ\\u (29) 
where the first inequality follows from the fact that ||x||2 < ||x||i, Vx. Inserting (29 1 into (28 1 gives 



£||r7^||2 < (l + 5)(l-5)-i ££||t7^,||i < (1 + 5)(1 - 5)-i£||%c||i. (30) 

7=1 

By the Cauchy-Schwartz inequality ||T7/f||i < ^^^^| 1^7/^1 12> we therefore obtain 

£||T7||i =E\\^Kh+E\\^Kch < (i+^'^^Y3|)^ll%c||i, (3i) 

which verifies the null space property with constant Cq. ■ 
For Gaussian signals x G ^(0,S), with <I> Gaussian or Bernoulli matrices, one realization drawn 

^As in 1171 . the result here is in the li norm, while in the next section we will consider a natural extension of the RIP for SCS 
which can be studied in the I2 norm, something possible for conventional CS only in a probabilistic setting, with one random 
sensing matrix independently drawn for each signal 1171 . 
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independently for each x, and with A the optimal decoder ([5]), the decay of Sx.^ | T] [«] | assumed in 
Theorem [3] is verified through Monte Carlo simulations. 

3) From Random Matrices to Linear RIP: The next Theorem shows that Gaussian and Bernoulli 
matrices satisfy the conventional RIP for one subspace with overwhelming probability. The linear RIP 
will be addressed after it. 

Theorem 4. /[7]/, ^ Let <t> be a random matrix of size M x N drawn according to any distribution that 
satisfies the concentration inequality 

Pr(|||<I>x||^-||x||^| >e||x||^) <2e-^^<'(^/2)^ V x G R^, (32) 

where < 5 < 1, and co{5 /2) > is a constant depending only on 5/2. Then for any set K . . ,N} 

with \K\ = k <M, we have the conventional RIP condition 

(l-5)||x||2 < ||<I>x||2 < (l + 5)||x||2, VxG^jf, (33) 

where S^k the set of all vectors in M.^ that are zero outside of K, with probability greater than or equal 
to 1 — 2(12/5)*e^'^"'^/^^''^. Gaussian and Bernoulli matrices satisfy the concentration inequality ( |32[ ). 



The linear RIP of order k pT] ) requires that ( [33] ) holds for N /k < N subspaces. The next Theorem 
follows from Theorem |4] by simply multiplying by the probability that the RIP fails to hold for one 
subspace. 

Theorem 5. Suppose that M, N and < 5 < 1 are given. Let ^ be a random matrix of size M xN drawn 



according to any distribution that satisfies the concentration inequality ( |32[ ). Then there exist constants 
ci,C2 > depending only on 5 such that the linear RIP of order k ( |27[ ) holds with probability greater 
than or equal to 1 —2Ne^'^^^ for <t> with the prescribed 5 and k < ciM. 

Proof: Following Theorem |4| for a ^-dimensional linear space the matrix <I> will fail to 
satisfy ([33]) with probability < 2{l2/5fe-''''^^/^>. 



The linear RIP requires that ([33]) holds for at most such subspaces. Hence ( |33] ) will fail to hold 
with probability 

< 2N{l2/dfe-'o^^^^'^'^ = 2A^f.-^o(5/2)M+<:iog(i2/5)_ (-34^ 



Thus for a fixed ci > 0, whenever k < c\M, the exponent in the exponential on the right side of ( |34| ) 
is < C2M provided that C2 < co(5/2) —c\{\ +log(12/5)). We can always choose ci > small enough 
to ensure C2 > 0. This proves that with a probability 1 — 2Ne^'^^'^, the matrix <I> will satisfy the linear 



RIP (27 1. 



Comparing with conventional CS, where the null space property requires that the RIP (33 1 holds 
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for (, ) subspaces H, |[T4l . ifTSl . the number of subspaces in the Unear RIP ([27]) is sharply reduced to 



N /k for Gaussian SCS, thanks to the coefficients pre-ordering and the linear estimation in consequence. 
Therefore with the same number of measurements M, the probability that a Gaussian or Bernoulli matrix 
<I> satisfies the linear RIP is substantially higher than that for the conventional RIP. Equivalently, given the 
same probability that <I> satisfies the linear RIP or the conventional RIP of order k, the required number of 
measurements for the linear RIP is M ^{k), substantially smaller than the M ~ ff{klog{N /k)) required 
for the conventional RIP. Similar improvements have been obtained with model-based CS that assumes 
structured sparsity on the signals JH. 

With the results above, we have shown that for Gaussian signals, with sensing matrices satisfying 
the linear RIP pT] ) of order 2k, for example Gaussian or Bernoulli matrices with ^{k) rows, with 
overwhelming probability, and with the optimal and linear decoder (|5]l, SCS leads to the instance 
optimality in expectation of order k in ( fT6| ), with constant Cq = 2(1 + k^^^j^)- k^^^ is typically 
small by the definition of CS. 



F. Performance Bounds with RIP in Expectation 

This section shows that with an RIP in expectation, a matrix isometry property more adapted to SCS, 
the Gaussian SCS MSE instance optimality ( fTT] ) of order k and constant Co, holds in the h norm with 
probability one for any matrix. Co has a closed-form and can be easily computed numerically. 

Definition 4. Let x G be a random vector that follows a certain distribution. Let ^ be an M x N 
sensing matrix and let A. be a decoder Let r\ =x — A(<I>x). in (<I>,A) is said to have RIP in expectation 
in K with constant ck if 

Ex,{t>)\\^riK\\l = CKE^^(^^)\\riK\\l, where T] = x - A(<I>x), (35) 

where A'c{l,...,A^}, TjK ^ is the signal rj restricted to K (riK[n] = ri[n], V ?i G K, and otherwise ), 
and the expectation is with respect to x, and to <I> ;/ one random is drawn independently for each x. 

The conventional RIP is known to be satisfied only by some random matrices, Gaussian and Bernoulli 
matrices for example, with high probability. For a given matrix, checking the RIP property is however 
NP-hard p^. By contrast, the constant of the RIP in expectation ( [35] ) can be measured for any matrix via 
a fast Monte Carlo simulation, the quick convergence guaranteed by the concentration of measure |l33l. 
The next proposition, directly following from ([6]) and (|7]l, further shows that for Gaussian signals, the 
RIP in expectation has its constant in a closed form. 
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Proposition 1. Assume x ^(0,S), ^ is an M xN sensing matrix and A is the optimal and linear 
decoder ([5]). Then in (<I>,A) satisfies the RIP in expectation in K, 

) [rr (4>R;f SR J*^ - 4>R;f SO^ (OSO^ ) " ' OSR ) ] 
= CK{E<t,) [Tr (R/fSRj - R;fS4>^(<I>S4>^)"'4>SRj)] , (36) 

where Rjf is an N xN extraction matrix giving rj^ = Rjf^. ^K{i,i) = 1. V/ G K, all the other entries 
being zero. The expectation with respect to <I> is calculated if one random <I> is drawn independently for 
each X. 



Proof: Let t] =x — AOx = x — S<I>^(<I>S<I>^)"^<I>x, which follows from the MAP estimation ( |36l ) 
is derived by calculating the covariance matrices ^^r]^ = ^ of ^'{]k = ^^K'f], and 

Ztj^ =£ [Ra:T7(Ra:T7)^] of r]K = Rjf^7> and using the fact that the trace of a covariance matrix yields the 
average energy of the underlying random vector. ■ 

The next Theorem shows that the RIP in expectation leads to the MSB null space property holding 
in equality. 

Theorem 6. Let x € be a random vector that follows a certain distribution, ^ an M x N sensing 
matrix, and A a decoder. Assume £x.(<l>) ll^Jf II2 ^ andE^^j^^-^ II^a:c||2 7^ 0, /or some K<z{\,... ,N}. Assume 
that in A) has the RIP in expectation in K with constant ax > 0, and in = {1, . . . ,N}\K with 
constant bx > 0.' 

gx,(o)ll<^>r7dli ^^^^ gx.(<.)ll^%c|li ^^^^ w/..r.r7=x-AOx, (37) 

^x,(<I>)ll^/f II2 ^x,(<I>)ll%c|l2 

where K C {1,...,A^}, and rjn G is the signal rj restricted to K ('{\K\n\ = '{\\n\, V « G K, and 
otherwise). Then <I> satisfies 

1 1 1 1 2 1 1 2 

^x,(<I>)ll^ll2 = C'o£x,(<i>)ll%c||2, (38) 

where Cq = \ + bK/ ok- In particular, if K = {\,. . . ,k}, with 1 <k<N, then <I> satisfies the MSE null 
space property of order k, which holds with equality, 

E.,mM\l = CoOk{{r]])l (39) 

Proof: We derive ( [38] ) by 

^x,(<I>)Nll2 _ ^ _|_ ^x,(tl>)ll%ll2 _ ^ _^ ^x,(<I>)ll*J*%ll2M 
^x,(<I>)ll%c|l2 ^x,(<I>)ll%c|l2 ^x,($)ll*J*%c||2/^A: ' 



where the second equality follows from the RIP in expectation ( |37) and the last equality holds because 
^r\K = ^Tlj^c since rj = riK + %c G Null(<I>). ([39]) is obtained by inserting ([13]) in ([38]). ■ 

Following Corollary [2] the MSE null space property constant Co indicates the upper bound of the SCS 
reconstruction error relative to the best fc-term linear approximation. Let us check Co of different sensing 
matrices in SCS for Gaussian signals x G ~ ^(0,S), assuming that the eigenvalues of S follow a 
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power decay ( [TT] ) with typical values a = 3 and = 64. Gaussian, Bernoulli and random subsampling 
matrices <I> of size M xN are considered, and the optimal and linear decoder A ([5]) is applied to reconstruct 
the signals. For each matrix distribution, a different random matrix realization <I> is applied to sense each 
signal X. Note that since the random subsampling matrix <I>, each row containing one entry with value 1 
at a random position and otherwise, has the maximal coherence with the canonical basis, this matrix 
is not suitable for directly sensing x lITOl . and is replaced by <I>*F in the simulation, with *P a DCT basis 
having low coherence with <I>. 



Monte Carlo simulations are performed to calculate the RIP constants and (37). Figure [3] 

(a) plots Co = 1 +bK/aK, with a typical value = 10 {k/N = 5/32), for different values of M. When 
the number M of SCS measurements increases, the reconstruction error of SCS decreases, resulting in 
a smaller ratio over the best ^-term linear approximation error with a fixed k. Gaussian and Bernoulli 
matrices lead to similar Co values, slightly smaller than that of random subsampling matrices. Figure [3] 

(b) plots Cq, as a function of k, with M = k. Gaussian and Bernoulli matrices lead to similar Cq ^ 4.5 
that varies little with k, in line with the results obtained in Section |II-D (Figure |2]-(c)). For random 
subsampling matrices Co slowly increases, almost linearly, and is equal to 5.5 for a typical value k = 10, 
about 20% larger than that of Gaussian and Bernoulli matrices. The small Co values indicate that the SCS 
reconstruction error is tightly upper bounded by a constant times the best k-term approximation error. 



— Gaussian 
—Bernoulli 
^Subsampling 




(a) 



9 

8.5 
8 

7.5 
7 
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— Gaussian 
—^Bernoulli 
-^Subsampling 




(b) 



Fig. 3. The MSB null space property constant Cq ^39) of Gaussian, Bernoulli, and random subsampling matrices, as a function 
of M, with a fixed ^ = 10 (left), and of k with M = k (right). The signal dimension is N = 64. 



From Corollary [2] and Theorem [6j we obtain the next concluding Theorem, which shows that for any 
sensing matrix, the error of Gaussian SCS is upper bounded by a constant times the best fc-term linear 
approximation with probability one, and the bound constant can be efficiently calculated. 

Theorem 7. Assume x ~ ^(0,S). Let be an M x N sensing matrix and A the optimal and linear 
decoder ([5]). Then <I> satisfies the MSB instance optimality of order k(\l) with constant Co = 4 ( 1+ Z^/f /a/f ), 
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and given in ( 37 1, and K = { 1 , . . . , fc} 



Theorem [7} together with the performance comparison of linear and nonlinear approximation for 



Gaussian signals described in Section II-C show that for signals following a Gaussian distribution with 
fast eigenvalue decay, the average error of SCS using k measurements is tightly upper bounded by that 
of the best k-temv approximation. 

III. Compressed Sensing Model Selection with GMMs 

Section [n] shows tight error bounds of SCS for signals following a Gaussian distribution with fast 
eigenvalue decay. A single Gaussian distribution, however, is too simplistic for modeling most real signals. 
Assuming multiple Gaussian distributions and that each signal follows one of them, Gaussian mixture 
models (GMMs) provide more precise signal descriptions. It has been shown that algorithms based on 
GMMs lead to results in the ballpark of the state-of-the-art in various signal inverse problems, for different 
types of real data including images and ranking score matrices ll24l . ll36ll . GMMs have also been used to 
model color distributions ll32ll and for clustering ll20l . among many satisfactory applications with these 
models. 

This section first introduces a piecewise linear decoder for GMM-based SCS, which essentially 
consists of estimating a signal using each Gaussian model included in the GMM and then selecting 
the best model. At the heart of the GMM-based SCS decoder is the model selection. The rest of the 
section analyzes the accuracy of the model selection in terms of the GMM properties and the number of 
the measurements. As correct Gaussian models are selected, the SCS performance bounds described in 
Section |IT] apply. 

A. Piecewise Linear Decoder 

GMMs describe signals with a mixture of Gaussian distributions. Assume there exist / Gaussian 
distributions {-^ [i^j \<,j<,j , parametrized by their means jXj and covariances Zj. To simplify the 
notation, we assume without loss of generality that the Gaussians have zero means jXj = 0, as one 
can always center the signals with respect to the means. GMM assumes that each signal x € is 
independently drawn from one of these Gaussians with an unknown index j € [1,/], whose probability 
density function is 

^""- (2.r/'iE;r/2 ""'(-k^J''')- <*» 
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To decode a measured signal y = Ox, the GMM-based SCS decoder estimates the signal x and selects 
the Gaussian model j by maximizing the log a-posteriori probability 

{%!) = argmaxlog/(x|y,Z^). (41) 



(41 1 is calculated by first computing the linear MAP decoder (|2]) using each of the Gaussian models, 

x^- = Aj (<I>x) = lyO^ (<I>E;<I>^ ) - ^ (Ox) , VI < j < 7, (42) 



and then selecting a best model j that maximizes the log a-posteriori probability among all the models [,36J 

/= arg max^-^ (log +x]i:j%^ , (43) 
whose corresponding decoder Aj implements a piecewise linear estimate: 

x = xj = Aj(Ox). (44) 

The model selection ([43]l is at the heart of the GMM-based SCSj^ To better understand it, we 
concentrate next in a simple case, where the GMM involves 7 = 2 Gaussian distributions ./K(0,Ei) and 
^{0,1.2) that have the same "shape" and "size", but different "orientation," i.e., the two covariance 
matrices have the same eigenvalues, but different PCA bases: 

ri=BiSB[ and r2 = B2SB[, (45) 

with Bi and B2 the PCA bases of the two Gaussian distributions, and S = diag(Ai, . . . , Aat) a diagonal 
matrix, whose diagonal elements Ai > A2 > . . . > A^r are the sorted eigenvalues. It follows directly that 
iZil = |r2|- This will be used next. 

B. Oracle Model Selection 

Let us first study the model selection in an oracle situation, where the underlying signals x are assumed 
to be known and, without loss of generality, to follow the first Gaussian distribution x ./K(0,ri). Recall 



that iZil = |r2| is assumed. The probabiUty of correct oracle model selection ( [43] ) that assigns x to the 
first Gaussian distribution c/K(0,ri), 



I^=f /i(x)Jx= /■sign(x%-ix-x^rrix)/i(x)^/x, 



(46) 



■'Correct model/class selection from compressed measurements is at the core of numerous applications beyond signal 
reconstruction, see for example |15 | and references therein. 
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where fi (x) 



{27t)'^/^\T, \1 I )' ^^^^ studied as a function of the relationship between Bi 

and B2, the decay rate of the eigenvalues, and the signal dimension A^. 



1) KL Divergence: To better understand ( [461 ), l^t us first check the KuUback-Leibler (KL) divergence 
from the first Gaussian distribution to the second 



Dkl = \j (x^r2^x-x^rrix)/i(x)cfx 



1 



Tr(i:2 Ii - I 



1 



2 "^^2 -^-^N) 2' 

where Ia? denotes the N x N identify matrix, and the second equality holds since ^[x^Ax] = Tr(Ar) if 
X ^(0,r) |[30l . Comparing ( |47] ) and ( |46l ), we observe that Dkl is monotonic relative to P°. Analyzing 



(Tr(r2%)-A^), 



(47) 
(48) 



the behavior of Dkl as a function of the two Gaussians thus helps to understand that of P° 



Inserting (45 1 into (48 1 leads to 



Dkl = ^(Tr(B2S-iB[BiSB[) -A^) = ^(Tr(CSC^S-i) -A^), 



(49) 



where C = B2 B 1 , and the second equality follows from the cyclic permutation invariance property of 
the trace Tr(ABC) = Tr(CBA). Note that C is an orthogonal matrix: C^C = In. Maximizing Dkl with 
respect to Bi and B2 is therefore equivalent to maximizing Tr(CSC^S-') with respect to C. The following 
lemma shows that in dimension two, Dkl is maximized when the first principal directions of the two 
Gaussians are orthogonal, and moreover, the maximum divergence increases as the Gaussians become 
more anisotropic. 

Lemma 2. Let Bi and B2 be respectively the PCA bases fZi = BiSB^ and £2 = B2SB2) of two centered 

\ X ~ 

2D Gaussian distributions ^(0,ri) and ^(Q,L2), and S = „ . 

U A2 



with Ai > and A2 > their 



common eigenvalues. The KL divergence from the first Gaussian distribution to the second (47 1 has a 
maximum value 



Dmax 
KL 



which is obtained when B^Bi 



1 

1 



Let the determinant of the covariance matrices |£i| = IZ2I = X1X2 further be assumed given. Then 
D^f- is minimized as X\ = X2, and it increases as the ratio between X\ and X2 increases. 



sin Q cos d 



Proof: The first part of the lemma can be easily checked by maximizing Dkl in ( [491 ) with respect 

cos B — sin 

to the 2D orthogonal matrix C = Bj Bi and writing C 



The second part is verified 
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via a direct observation of (fSOll. 



Figure |4]-(a) plots Dkl as a function of the angle between the first principal components of the two 
2D Gaussians going from 5° to 90°, with different eigenvalue ratios A1/A2 from 5 to 100. As indicated 
by Lemma[2j given A1/A2, Dkl increases as Q increases. At a given d, larger A1/A2 leads to larger Dkl- 



that 



The analysis in higher dimension is more difficult, however, one can check via a greedy optimization 

1 

••• •■■ 1 



B2B1 



1 

1 







(51) 



with ones along the anti-diagonal, and zeros elsewhere, gives a local maximum of (|49]l. In other words, 
the two Gaussians being "orthogonal" one another, i.e., the alignment of the first principal component of 
one Gaussian to the last principal component of the other, the second principal component of the former 



to the second to last principal component of the latter, and so on, leads to a local maximization of (|49 1 
This can be observed by inserting 

Cii ... Cm 



Cni 



Cnn 



in (49 1, which gives 



1 ^ 1 

^ m=l „=i 



N 



D 



KL ■ 



•2 

inn 



(52) 



A greedy maximization of with respect to C is calculated by scanning C row by row from bottom to 
top, observing that l/X,„ decreases as m goes from to 1, and at each m-th row scanning C,nn from left 



to right, observing that A„ increases as n goes from 1 to A^, taking into account the constraint C^C = I 



A similar observation of ^52^ shows that when Dkl is at the local maximum with C equal to ([51 
value increases as the eigenvalues decay faster from Ai to Xn- 



-N- 



, Its 



2) Correct Model Seletion Probability: The probability of correct oracle model selection P° (46 1 is 



now evaluated via Monte Carlo simulations. Figure |4]-(b) plots P° as a function the angle Q between 
the first principal components of the two 2D Gaussians going from 5° to 90°, with different eigenvalue 
ratios A1/A2 from 5 to 100. As illustrated in Figure |4j P° shows a behavior similar to the KL-divergence 
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Dkl as a function of Q and of A1/A2: Given A1/A2, P° increases as d increases; at a given 6, larger 
A1/A2 leads to larger P°. In contrast to Dkl, whose value is roughly proportional to A1/A2 (as X\ » A2), 
P° presents a saturation effect: A1/A2 values larger than about 40 lead to comparable P° that increases 
rapidly as a function of d, converging to a high value around 0.9; for A1/A2 smaller than about 40, on 
the other hand, P° reduces quickly as A1/A2 shrinks towards 1. 




(a) (b) 



Fig. 4. (a) The KL-divergence \A1) between two 2D Gaussians, as a function the angle between the first principal components 
of the two Gaussians going from 5° to 90°, with different eigenvalue ratios A1/A2 from 5 to 100. (b) The same for the probability 
of correct oracle model selection P" \A6\ . 



Figure |5] shows the probability of correct oracle model selection P° ( |46l ) in higher dimensions, 
under the condition that (BTl) holds, i.e., the two Gaussians are "orthogonal." A power decay of the 



eigenvalues ( [TT] ) is assumed in the Monte Carlo simulations. In different signal dimensions from 2 to 
20, P° as a function of the eigenvalue decay parameter a is plotted. For a given dimension, P° increases 
as a increases, i.e., as the eigenvalues decay faster so that the Gaussians are more anisotropic. It is 
important to remark that, with the same a, P° rapidly increases as the signal dimension increases, 
which shows that anisotropic Gaussians with their energy concentrated in the first few dimensions are 
more separate in higher dimension. 



C. Model Selection and Signal Reconstruction 



In SCS, the model selection ( [43] ) is calculated with the decoded signals ( [42| ) and not the ideal ones. 
Assume without loss of generality that the signals follow the first Gaussian distribution x ~ ./f (0,ri). 
This section checks via Monte Carlo simulations the probability of correct model selection ( |43] ) calculated 
with the decoded signals xi = AiOx and X2 = A2OX, 
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Fig. 5. The probability of correct oracle model selection P° l|46j between two Gaussians, as a function of the eigenvalue decay 
parameter a from 1 to 5, for different signal dimensions A' from 2 to 20. The two Gaussians satisfy \5\) . 



where the expectation is with respect to <I> if one random <I> is independently drawn for each x. We also 
investigate the MSE of the resulting signal reconstruction, 



L , , , ||x-xi||2/i(x)^/x+ / ||x-X2||i/i(xMxy (54) 

as a function of the number of sensing measurements M and the properties of the Gaussian distributions. 



^x,(ci>)l|x~x||^ = {E^) 



Figure [6] shows the probability of correct model selection Pc ([53]) and the MSE of signal reconstruc- 
tion ([54]) as a function of the number of measurements M and the signal dimension A^. Figure [6]-(a) plots 
Pc as a function of M going from 1 to A^, with different values from 2 to 15, assuming that ([STJ 



holds, i.e., the two Gaussians are "orthogonal." A power decay model of the eigenvalues ([TT| with a 
typical decay parameter a = 3 is assumed in the simulations. A random Gaussian matrix realization O 
is drawn independently to sense each signal. As expected, P^ increases as M goes from 1 to A^, i.e., as 
more measurements are dedicated. The signal dimension A^ plays an important role. With only M = 1 
measurement, the model selection is uniformly random {Pc ~ 0.5), independent of the signal dimensions 
A^. At an extremely low dimension N = 2, even with M = N measurements (which leads to perfect signal 
reconstruction, as if in the "oracle" case described in Section III-B i, Pc remains lower than 0.8. |^When 
A^ goes higher, Pc rapidly increases converging towards 1 as M increases. After A^ stands above a certain 
value (about 10 in this example, note that for the image examples in the next section A^ = 64), Pc converges 
very close to 1 as far as M reaches a fixed value (about 8) independent of A^. This indicates that accurate 
model selection can be achieved with very low sampling rates M/N, given that the energy of the signals 



'^We observe that a mistake in the model selection will not necessarily lead to a mistake in the reconstruction, e.g., flat image 
patches can often be recovered by multiple different models. 
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is concentrated in the first few principal dimensions. In signal sampling, one is more interested in the 
signal reconstruction error than model selection. Figure [6]-(b) similarly shows the MSB of the decoded 



signals ( |54| ) (normalized by the ideal signal energy). The MSB decreases as M increases, and it goes to 
as M = A^. At high dimensions (over about 10), almost perfect signal reconstruction is obtained as 
far as M reaches a fixed value (about 8). 
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Fig. 6. (a) The probability of correct model selection \53\ as a function the of the number of measurements M from 1 to the 
signal dimension A', with A' going from 2 to 15. (b) The same for MSB ([54j (normalized by the ideal signal energy) of the 
decoded signals. 



Similarly, Figure |7] plots the probability of correct model selection Pc ( [53] ) as well as the MSB of 
the decoded signals ([54]) (normalized by the ideal signal energy), as a function of the measurements M 
going from 1 to the signal dimension = 10, with different eigenvalue decay parameter a from 1 to 5. 
As a increases, i.e., as the eigenvalues decay faster, and MSB respectively converge to 1 and at a 
faster rate as M goes from 1 to A^. 
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Fig. 7. (a) The probability of correct model selection l[53j, as a function of the number of measurements M from 1 to the 
signal dimension A' = 10, with different eigenvalue decay parameter a from 1 to 5. (b) The same for MSB ^54) (normalized by 
the ideal signal energy) of the decoded signals. 



In summary, this section shows that the accuracy of the Gaussian model selection ( [43] ) in GMM-based 
SCS is influenced by a number of factors including the geometry of the Gaussian distributions in the 
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GMM, the signal dimension, and the number of sensing measurements. More accurate model selection 
is obtained as the Gaussians distributions are more "orthogonal" one another, as each of the Gaussians 
is more anisotropic, as the signals are in a higher dimension given that the energy of the signals are 
concentrated in the first few dimensions, and as the number of sensing measurements increases. 

IV. SCS WITH GMM - Algorithm and Experiments 

The GMM-based SCS decoder described in Section ITlI-AI assumes that the means and the covariances 
of the Gaussian distributions {^{l-ij,^j)}i<j<j in the GMMs are known. However, in real sensing 
applications, these parameters are unavailable. Following |36|, this ection presents a maximum a posteriori 
expectation-maximization (MAP-EM) algorithm 1 3 1 that iteratively estimates the Gaussian parameters and 
decodes the signals. GMM-based SCS calculated with the MAP-EM algorithm is applied in real signal 
sensing, and is compared with conventional CS based on sparse models. 

A. MAP-EM Algorithm 

The MAP-EM algorithm is an iterative procedure that alternates between two steps: 

1) E-step: Assuming that the estimates of the Gaussian parameters {(/i;,^;)}i<;<y are known 
(following the previous M-step), the E-step calculates the MAP signal estimation and model selection 



for all the signals, following (41)-(44l 



2) M-step: Assuming that the Gaussian model selection j and the signal estimate x are known 
for all the signals (following the previous E-step), the M-step estimates (updates) the Gaussian models 

{{lljXj)}i<j<J- 

Let X,, y,-, X/ and respectively denote the /-th signal in the collection, its coded version, its estimate, 
and its estimated Gaussian model index, !</</. Let 'loj be the ensemble of the signal indices / that are 
assigned to the ^-th Gaussian model, i.e., = {i : ji = j}, and let be its cardinality. The parameters of 
each Gaussian model are estimated with the maximum likelihood estimate using all the signals assigned 
to that Gaussian model, 

(A;-'^;) =argmaxlog/({x,},e<^.|^y,ry). (55) 



With the Gaussian model (40 1 , it is well-known that the resulting estimate is the empirical estimate 
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The computational complexity of the MAP-EM algorithm is dominated by the matrix inversion 
(<I>ry<I>^)^^ in ( |42l ) in the E-step. It can be implemented with flops through a Cholesky fac- 

torization Q. With J Gaussian models, the complexity per iteration is therefore dominated by JM^ /?> 
flops. 

As the MAP-EM algorithm described above iterates, the MAP probability of the observed signals 
/({x/}i<,</|{y,}i<,</, {/iy,ry}i<y<7) always increases. This can be observed by interpreting the E- and 
M-steps as a coordinate descent optimization ll22l . 

The algorithm initialization and the number J of Gaussians in GMM can be selected according to 
the type of signals of interest. For sensing natural images, a geometry-motivated initialization as detailed 
in ll36ll will be applied in the experiments. 

B. Experiments 

The GMM-based SCS is applied in real image sensing, and is compared with conventional CS based 
on sparse models. Following standard practice, an image is decomposed into ^/N x ^/N = 8x8 local 
patches {x,}i</</ (an image patch is reshaped to and considered as a vector) 121, ll26ll . ll36l . which are 
assumed to follow a GMM ||36ll . SCS samples each patch y, = O/x,, with a possibly different for 
each X/. The decoder is implemented with the MAP-EM algorithm, initialized with / = 19 geometry- 
motivated Gaussian models, each capturing a local direction ll36l . The algorithm typically converges 
within 3 iterations. No database is used, and all the parameters and reconstruction are learned from the 
compressed sensed image alone. 

The dictionary for conventional CS is learned with K-SVD from 720,000 image patches, extracted 
from the entire standard Berkeley segmentation database containing 300 natural images ||28l. In image 
estimation and sensing, learned dictionaries have been shown to produce better results than off-the-shelf 
ones O, |[T9l . ||26l . The decoder is calculated with the l\ minimization |[34l implemented in |[25l . Three 
standard images Lena (512 x 512), House (256 x 256), and Peppers (512 x 512), as illustrated in Figure [8] 
are used in the experiments. 

Figure [9] (a) shows the sensing performance on about 260,000 (sliding) patches, regarded as signals 
X/, extracted from Lena. The PSNRs generated by SCS and CS using Gaussian and random subsampling 
sensing matrices, one independent realization for each patch, are plotted as a function of the sampling 
rate M/N. At the same sampling rate, SCS outperforms SC. The gain increases from about 0.5 dB at 
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Fig. 8, 



From left to right. Three standard images used for the experiments: Lena, House, and Peppers. 
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Fig. 9. (a) PSNR (dB) vs samphng rate for SCS and CS using Gaussian and random subsampling sensing matrices on image 
patches extracted from Lena, (b) PSNR (dB) vs sampling rate for SCS and CS using Gaussian sensing matrices on image patches 
extracted from House and Peppers. 

very low sampling rates {M/N 0.1), learning a GMM from the poor-quality measured data being more 
challenging, to more than 3.5 dB at high sampling rates {M/N ^ 0.5). (SC using an "oracle" dictionary 
learned from the ideal Lena itself, undoable in practice, improves its performance from 0.2 dB at low 
sampling rates to 1.3 dB at high sample rates, still lower than SCS.) For both SCS and CS, Gaussian and 
random subsampling matrices lead to similar PSNRs at low sampling rates {M/N < 0.25), and at higher 
sampling rates Gaussian sensing gains by about 0.5 dB. Recall that SCS is not just more accurate and 
significantly faster, but also uses only the compressed image, while conventional CS uses a pre-learned 
dictionary from a large database. 

Figure |9] (b) further compares SCS with CS on sliding patches, regarded as signals, extracted from 
Peppers (260,000 patches) and House (62,000 patches). One independent Gaussian matrix reaUzation 
is applied to sense each patch. Similar results as on the patches from Lena are observed. At the same 
sampling rate, SCS outperforms SC. The gain is smaller (about 1 dB) at very low sampling rates {M/N ^ 
0.1), and becomes substantial (about 3 dB) at high sampling rates {M/N ^ 0.5). 
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Figure 10 illustrates some typical patches with geometry. The ground-truth patches are shown in the 
first row, and the patches reconstructed by conventional CS and SCS, all sensed with Gaussian matrices 
at a sampling rate M/N = 1/4, are respectively illustrated in the second and the third row. Both CS and 
SCS lead to accurate reconstruction in uniform regions. SCS outperforms CS on the more geometrical 
parts, and the improvement is significant on the fine contours (the 2nd, 3rd and 7th patches). 
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Fig. 10. Some typical 8x8 patciies witli geometry. First row: ground-trutli patches. Second and third rows: patches reconstructed 
by conventional CS and SCS respectively, all sensed at a sampling rate M/N =1/4 with Gaussian matrices. 



In most image sensing applications, one is interested in reconstructing whole images instead of 
individual patches. Aggregating non-overlapped patches to a whole images produces block artifacts. 



as illustrated in Figure 11 It is well known that averaging overlapped reconstructed patches not only 
removes the block artifacts, but also considerably improves the image estimation ||2l, ||26l . |[36l . However, 
compressed sensing only allows sensing non-overlapping patches, since sensing overlapping patches 
would dramatically increase the sampling rate. Nevertheless, overlapped reconstructed patches are com- 
putable if the sensing operators, performed on non-overlapped patches, are random subsampling matrices, 
which are diagonal operators (one non-zero entry per row). (The reconstruction is then equivalent to 
solving an inpainting problem 0, ll36ll .) Figure [TT] shows some typical regions in Lena. The overlapped 



reconstruction, which further supports the search for performance on average as in the proposed SCS, 



removes the block artifacts and significantly improves the reconstructed image. Figure 12 plots the 
PSNRs on the whole image Lena generated by SCS using random subsampling matrices and overlapped 
reconstruction are plotted, in comparison with those obtained using Gaussian sensing matrices and non- 
overlapped reconstruction, at different sampling rates. The former improves from about 3.5 dB, at low 
sampling rates, to 1.5 dB, at high sampling rates, at a cost of = 64 times computation. 
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Fig. 11. From left to right. Zoomed crops from Lena, reconstructed images by SCS using Gaussian sensing matrices and 
non-overlapping reconstruction, and by SCS using subsampling random matrices and overlapping reconstruction. The image is 
sensed on non-overlapped patches at a sampling rate of M/N = 0.25. Local PSNRs are reported. 
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Fig. 12. PSNR (dB) vs sampling rate (on the whole image Lena), for SCS using Gaussian sensing matrices with non-overlapping 
reconstruction, and subsampling random matrices with overlapping reconstruction. 



V. Conclusion 

Statistical compressed sensing (SCS) based on statistical signal models has been introduced. As 
opposed to conventional compressed sensing that aims at efficiently sensing and accurately reconstructing 
one signal at a time, SCS deals simultaneously with a collection of signals. While CS assumes signal 
sparse models, SCS is based on a more general Bayesian assumption that signals follow a statistical 
distribution. SCS based on Gaussian models has been investigated in depth. It has been shown that based 
on a single Gaussian model, with Gaussian or Bernoulli sensing matrices of (k) measurements, consid- 
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erably smaller than the 6^ {k\og{N / k)) required by conventional CS, where is the signal dimension, and 
with an optimal decoder implemented with linear filtering, significantly faster than the pursuit decoders 
applied in conventional CS, the error of SCS is tightly upper bounded by a constant times the best fc-term 
approximation error, with overwhelming probability. The failure probability is also significantly smaller 
than that of conventional CS. Stronger yet simpler results, derived from a new RIP in expectation property 
further show that for any sensing matrix, the error of Gaussian SCS is upper bounded by a constant times 
the best fc-term approximation with probability one, and the bound constant can be efficiently calculated. 
For Gaussian mixture models (GMMs) that assume multiple Gaussian distributions, and that each signal 
follows one of them with an unknown index, a piecewise linear estimator is introduced to decode SCS. 
The accuracy of model selection, which is at the heart of the piecewise linear decoder, is analyzed in 
terms of the properties of the Gaussian distributions and the number of the sensing measurements. A 
MAP-EM algorithm that iteratively estimates the Gaussian models and decodes the compressed signals is 
presented for GMM-based SCS. Applications of GMM-based SCS in real image sensing has been shown. 
Comparing with conventional CS, SCS leads to improved results, at a considerably lower computational 
cost. 

This line of research opens numerous new questions in compressed sensing, from the formal de- 
velopment of bounds in the compressed domain model selection (see also ||9l), to the study of model 
parameters estimation in the compressed domain and the extension of the results here reported to non- 
Gaussian distributions. The work here reported also shows that compressed sensing is significant beyond 
sparse signal models, generating the natural question of what type of models can benefit from such 
sensing scenario. 
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